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Classification cancer subtypes. And then supervised infinite feature selection gene selection 
Co-clustering method was combined with multi class SVM for classification of selected 
Microarray genes and further biological analysis. The analysis on breast cancer and 


glioblastoma multiforme evidences that top genes involved in cancer and the 
pathways present in both cancer top genes. The functional analysis is useful in 
medical and pharmaceutical field for cancer diagnosis and prognosis. 
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1. INTRODUCTION 

Abnormalities of cancer genome can be observed through basic researches which have been used to 
categorize patients with respect to enhance their clinical decision making and implement more efficient 
treatments. Even though this types of categorization have enhanced the efficiency of treatment of various 
cancers, but the heterogeneity among the populations still remains as a main challenge. The advancement of 
DNA microarray technology has permitted an extensive understanding of genes especially in oncology field 
for start, diagnosis and prognosis of cancers. These various diagnostics are useful for different types of cancer, 
which lead to individual treatment plans and accurate clinical outcomes estimation [1, 2]. 

As the initial stage in organizing and investigating high-throughput gene expression datasets is 
through artificial intelligence in deep machine learning approach by grouping them together (cluster) according 
to similar biological features (gene) or conditions (samples) conferred on some similarity measures [3-5]. 
Meanwhile for both features and conditions are typically inadequate with prior knowledge, the clustering 
process is conducted as an unsupervised process via grouping features and conditions [6]. The conventional 
clustering is not said to be an ideal method for complicated and heterogeneous cancers. This is because, there 
are only certain genes in a subset of samples are expressed as a cancer genes in cellular processes among the 
similar clinical types of cancer in a specific tissue. Hence, it has been found a limitation that a single gene 
might play role in regulating and participating in numerous clusters and pathways of different conditions [7]. 
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Therefore, Cheng and Church [8] has been the pioneer in implementation bi-clustering approach on 
gene expression datasets. Fundamentally, co-clustering [or bi-clustering) simultaneously clusters genes and 
samples to discover subtypes or subgroups of genes which displays similar patterns within certain conditions 
subset of experiments [9-11]. On top of that, this application has provided unique opportunities and challenges 
to classify tumors and discover more tumor subtype mechanisms. In addition to this, the researchers are also 
trying to identify multiple bi-clusters at a time. For example, statistical method [12], information theory [13], 
matrix factorization [14], and graph based bi-clustering [15]. Many co-clustering methods were developed by 
researchers. Cho and Dhillon [16] proposed a minimum sum-squared residue co-clustering (MSSRCC) to 
identify coherent bi-clusters. Modular Singular Value Decomposition (Mod-SVD) was proposed by Aradhya, 
Masulli [17] to discover bi-clusters from SVD computation. Huang, Sun [18] developed a modified fuzzy co- 
clustering (MFCC) while Hussain and Ramazan [19] proposed a method based on co-similarity measure 
between genes (and conditions). 

Therefore, in this research we proposed to improve network assisted co-clustering for the 
identification of cancer subtypes (iNCIS). Generally this method utilizes gene network prior knowledge to be 
integrated with gene expression data to obtain bi-clusters. 


2. RESEARCH METHOD 
2.1. Dataset and Tools 

Two cancer microarray datasets were used in this research. They are Breast Cancer (BRCA) and 
Glioblastoma Multiforme (GBM). In addition, both of these datasets are publicly available at The Cancer 
Genome Atlas (TCGA) where BRCA obtained from the Network [20] meanwhile GBM is from Verhaak, 
Hoadley [21]. The cancer datasets were in text file format and had been pre-formatted to be used as an input 
for the software. These datasets mainly comprise of numerical values; rows representing the genes and columns 
representing the samples/patients but the classes are unknown. BRCA contains 17814 genes and 547 samples 
while GBM contains 11861 genes and 202 samples. Co-clustering and classification are performed using 
MATLAB 2014a, while Feature selection library (FSLib) [22] is used to combine for classification. 


2.2. Co-clustering and Validation 

In this stage, the first step is to assign weights to each genes using modified PageRank algorithm. 
Then the beginning of co-clustering (NCIS) algorithm where the objectives function is improved to minimize 
the sum-squared residues and optimizes matrix X. The selections of parameters are based on cophenetic 
correlation coefficient and some of them are default [23, 24]. For the validation, silhouette analysis was done 
[25]. The larger the silhouette value is better the clustering. Beside this, subnetworks are obtained for a 
particular gene for both cancer and validated. 


2.3. Classification and Validation 

The results of number of classes for both datasets were used further for feature selection and 
classification. Supervised infinite feature selection (SinFS) technique is combining to multiclass support vector 
machine (mSVM). Hence it is required the number of classes for each genes belongs to which we obtained 
from the co-clustering method. The genes with highest ranking and lowest threshold p-value are selected for 
classification further. And then, the selected top ten genes are analyze for functional analysis where the gene 
is belong to a specific cancer subtypes, biomarkers, oncogenes, transcription factors, tumor suppressor or 
antigen in addition their pathways involved. 


3. RESULTS AND ANALYSIS 
This section deeply discussing about the overall results obtained from implementation of co-clustering 
and classification of two cancer gene expression datasets. 


3.1. Cancer Subtypes and Subnetworks of BRCA 

From the implementation of BRCA dataset to iNCIS, there are five different cancer subtypes has been 
identified which are tabulated in Table | and visualize in Figure 1 (a) to (e). Gene ABCC8’s subnetwork was 
produced for these five subtypes that obtained from iNCIS. Subnetwork visualization which shows the 
expression level and the weight of the gene will be helpful in order to demonstrate the difference of the five 
subtypes. Primarily, this subnetwork was chosen because it has connected with small number of genes and is 
easy to present visibly. 

From BRCA data set, a small subnetwork (Figure 1) ABCC8 taken as an example to demonstrate the 
differences of five subtypes. There is total 16 genes presence in this subnetwork. Gene ABCC8 has been highly 
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expressed in Luminal A and B subtypes, moderately expressed in normal-like subtype and triple-negative and 
HER2 subtype shows low expression. Besides this, KLK11, KLK13, HDACS and RRAD genes are expressed 
moderate to high level in all subtypes. 


Table 1. Number of samples for BRCA subtypes 
Normal-like Triple negative/ basal-like Luminal A Luminal B HER2-enriched | TOTAL 
24 99 160 146 118 547 
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Figure 1. ABCC8 subnetwork in BRCA subtypes expression patterns. Direct connected genes to ABCC8 
and genes targeting ABCC8’s downstream are shown. Circle colour shows the gene expression level; 
circle size is based on gene weight. (a) Normal-like; (b) Basal-like; (c) Luminal A; (d) Luminal B; (e) 

HER2-enriched 
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3.2. Cancer Subtypes and Subnetworks of GBM 

In this study, four subtypes have been identified through implantation of GBM in the co-clustering 
algorithm. Table 2 shows the obtained subtypes and its number of samples. Scientists from TCGA have 
published the finding of four distinct subtypes of GBM. They are Proneural, Neural, Mesenchymal and 
Classical subtypes [21, 26]. 


Table 2. Number of Samples for GBM Subtypes 
Proneural _ Neural Mesenchymal _ Classical _ TOTAL 
51 64 43 44 202 











According to this implementation, gene NPTX1 act as the target gene, and the subnetwork was 
produced to interpret the relationship among genes. Figure 2 (a) to (d) shows all the four subtypes generated. 
Figure 2 (a) concluded as Proneural subtypes. This subtype is very common among the young adults which are 
normally characterised by IDH/TPS3 positivity [27, 28]. From iNCIS, 51 samples of Proneural subtype were 
obtained. In addition, proneural subtype mostly deriving from low-grade gliomas which are associated with 
better prognosis [27]. 
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Figure 2. NPTX1 subnetwork in GBM subtypes expression patterns. Direct connected genes to NPTX1 and 
genes targeting NPTX1’s downstream are shown. Circle colour shows the gene expression level; circle size 
is based on gene weight. (a) Proneural; (b) Neural; (c) Mesenchymal; (d) Classical 
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On average, Figure 3 shows optimal clusters. Hence the iodentification of five cancer subtypes of 
BRCA are concluded to be true. On average, Figure 4 shows optimal clusters. Hence the iodentification of four 
cancer subtypes of GBM are determined to be true. 
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Figure 3. Silhoutte plot for BRCA 5 subtype. Each Figure 4. Silhoutte plot for GBM 4 subtype. Each 
subtypes shows different values subtypes shows different 


3.3. Gene Analysis for BRCA and GBM 

Apart from that, for the gene selection process, gene ranking plays an important role. The ranking is 
given to these genes based on proposed SinfFS-mSVM method. Although a produced gene has obtained highest 
rank among the gene list, it does not promise that this gene will be selected as one of the best gene. 

The top ten (10) ranked genes listed in Table 3 and Table 4 are produced from gene ranking calculation 
as reported in proposed SinfFS+mSVM method. In addition, genes ranked from this list compared to genes’ p- 
value which is generated from t-test and ANOVA. The p-value from ANOVA is calculated for the whole data 
set of BRCA and GBM. Hence the obtained genes’ p-value is compared and selects the lowest p-value gene as 
the threshold value is 0.05 with the highest gene weight score. 


Table 3. Top 10 Genes for BRCA subtypes 
Overall Normal-like Triple negative Luminal A — Luminal B HER2 








NPYIR COLI7A1 BCLIIA CCL13 IGF1* HOXB13* 
CEACAM6 KRT5 ABCC8* EGFR JAM2 MYBL2 
TFAP2B SFRP1 NATI ESR1* LAMA2 PKMYTI1 
UGT2B11 ID4 GRPR IL18RAP NDN CDKN3 
SCGB2A2 NFIB MLPH* LCK SLIT2* E2F1 
CBLN2 KRT17 CA12 TBX21* RUNXITI1 PLK1* 
ROPN1B OSRI SCUBE2 PLA2G4A JAM3 AURKA* 
AREG TRIM29 GFRAI* SLAMFI BMX KIF4B* 
ROPN1 EGFR ESRI PTX3 CXCL12 GTSE1 
PDZK1 BIRCS ERBB4 BCLIIA COLI4A1 CX3CRI 





Red: significant genes; * found particularly in the subtypes 


Table 4. Top 10 genes for GBM subtypes 








Overall Proneural Neural Classical | Mesenchymal 
RPS4Y1 CHD3 PTPRJ ACTA2 POLB 
LIF PGMI1 ACY1* LHX1 CEP76 
IL8 ADARB1 CEP27 ASGR1* NFATC4 
DKKI1 DMD CPS1 AMFR* MAP3K14* 
EGFR OXT PF4 RPSI5A ACTA2* 
PTX3 ANGPTL4 CNTNAP2 CYBA DNAJB5 
IL13RA2 CYP7AI1 DNAJB5* CCT6A PPARD 
FABPS EIF2B5 THAPI1 INHA* RABI17 
CHI3L1 BCL2L10 PHKB RNGTT MYH2 
MOXD1 PKLR RPL19 RGS14* TCERG1 





Red: significant genes; * found particularly in the subtypes 
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The first column of both Table 3 and Table 4 which is named as overall set of gene is produced through 
comparison of lowest ANOVA p-value and highest weight score. Whereas, the gene list for each subtype is 
attained from comparison of lowest p-value from t-test and highest weight score. 


3.4. Pathway Analysis for BRCA and GBM 

The pathways involved in each subtype of BRCA and GBM investigated. The pathway analysis 
conducted on gene selected for each subtype as shown Table 5 and 6 for BRCA and GBM respectively. Hence, 
GeneCards (https://www.genecards.org) and KEGG: Kyoto Encyclopaedia of Genes and Genomes 
(https://www.genome.jp/kegg/) used to identify the top significant pathways for each subtypes. The pathway 
analysis was conducted for the selected overall genes on both data sets. 

Table 5 and 6, shows the top enriched pathways of the common genes and subtype-specific genes of 
each subtype from BRCA and GBM correspondingly. The pathways for both data sets have been classified 
into seven different types, which involve (i) cellular process, (2) metabolism, (3) environmental information 
processing, (4) nervous system, (5) cancers, (6) immune system, and (7) other organismal systems. The first 
column in both table shows the pathways obtained based on overall gene from BRCA and GBM data sets. 

It can be noticed that the BRCA pathways are relatively different among different subtypes. The 
pathways for overall genes and samples (in the first column of the Table 5) are related to metabolism and 
cellular process. Pathways for triple-negative subtype are associated to cancers and metabolism which infers 
the growth of breast cell tumors. The pathways in subtype Luminal A related to cancers, metabolism and 
immune systems. Pathways in subtype Luminal B are highly related to cancers and metabolism type. While 
pathways from HER2 subtype are linked with cellular process and metabolism. Most of these pathways are 
having linked with breast cancer in breadth and depth. 


Table 5. Top 10 Pathways for BRCA Subtypes 
Luminal A Luminal B HER2 












Overall 


Hematopoietic Stem Integrated Breast Cancer 
1 Pathway 


Integrated Breast Pathways in cancer 
Cancer Pathway. 


Triple-negative 





Endometrial cancer EGF/EGER Signaling 


Integrated Breast Cancer 
Pathway. 


Cellular process 
Immune system 
Other organismal system 





/ Cancer 


Metabolism 
Environmental information processing 
Nervous system 


From Table 6, it can be concluded that pathways for overall top genes from GBM are linked with 
mixed types such as immune system, metabolism, nervous system and cellular process on cell growth. 
Proneural subtypes involved highly in cellular process, metabolism and some immune system types. Beside 
this, neural, classical and mesenchymal subtypes have greatly involved in cellular processing pathways and 
significant metabolism pathways. Moreover, mesenchymal subtype having many immune system pathways 
implicates. On top of that, most of these pathways are associated with glioblastoma multiforme cancer which 
is incorporate directly or indirectly. 
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Table 6. Top 10 Pathways for GBM Subtypes 





Overall Proneural 




















Neural Classical Mesenchymal 


Apelin signaling pathway 


Galactose metabolism 


Reelin Pathway Pyrimidine metabolism 
(KEGG). 


C6 deamination of adenosine Metabolism of proteins 
Lipoprotein metabolism 
Metabolism. 
Apelin signaling 
pathway 


Peptide ligand-binding 
receptors 





Cancers 
Immune systems 
Other organismal system 


NAD metabolism Lipoprotein metabolism Lipoprotein metabolism 
Cellular process 
Metabolism 


gy Environmental information processing 


a Nervous system 





4. CONCLUSION 

From the results, it can be concluded that, there are five subtypes from BRCA and four subtypes from 
GBM were successfully identified. The iNCIS algorithm is able to produce simple subnetwork to show gene 
expression in each subtype. By the feature selection and classification, it has been able to prioritize significant 
genes for each subtype of both data sets which are analyzed for disease prognosis and diagnosis. 
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